We chose the NYC Open Dataset for NYPD Motor Vehicle Collisions . This is an open source dataset updated on a daily basis that tracks all road accidents in New York City. We want to use this dataset to infer trends about the various factors that cause accidents. A key question we wish to answer is Can we provide insights to help make the streets of NYC safer? Understanding the various reasons for accidents can certainly help us answer this question. For example, one of the things we want to identify are the “hotspots” - intersections, streets etc where most accidents occur. This is could be a useful insight for authorities to take to necessary actions to remediate this.
Location: What boroughs have the highest distribution of accidents? Which locations (intersections, streets) are the hotspots for accidents? Why do some zip codes have more accidents?
Time: What time of the day is prone to accidents? Which kinds of accidents are more likely to happen during a certain time of day.(eg. are pedestrians more likely to get hit at night?) How has accident frequency and deaths changed over time (months and years)?
Accident Information: What are the main factors contributing to accidents? What are the types of Vehicles that cause more accidents? Is there a relation between the kind of vehicle and the accident location? What are the causes of accident in different vehicle types? Which mode of transportation is more likely to get involved in an accident? What are the main reasons behind accidents at hotspots (ex. inattentive driver?)
Fun Misc: What has been going on around Columbia University campus? Any intersections around Columbia we need to be wary of? Am I safer if I walk on Amsterdam instead of Broadway? Which neigborhoods have high accident rates?
The data was collected from the NYC Open Data website. It provides information on the exact location, borough, time, cause, type of vehicles, number of deaths and injuries for every accident. The data is from 2012 to the first quarter of 2018. This data is collected by the police department (NYPD) because the NYC Council passed a local law in 2011. The data is manually run every month and reviewed by the TrafficStat unit before being posted on the NYPD website.
Each record represents a collision in NYC by borough, precinct and cross street. The data can potentially provide insights into how dangerous or safe certain intersections in NYC are. The information is available in both pdf and excel formats.
We downloaded the CSV file from the NYC Open Data webpage for this dataset. The dataset is updated on a daily basis (during weekdays). For the sake of uniformity and consistency of analyses, we used the accident data collected upto March 13, 2018.
Features: The dataset is 1.22 million rows and 29 columns (260 MB). The columns cover key aspects of every accident: Date, Time, Borough, Zip Code, Latitude, Longitude, Location, On Street Name, Cross Street Name, Number of Persons Injured, Number of Persons Killed, Number of Pedestrians Injured, Number of Pedestrians Killed, Number of Cyclist Injured, Number of Cyclist Killed, Number of Motorist Injured, Number of Motorist Killed etc.
library(tidyverse)
library(ggmap)
library(choroplethrZip)
accidents <- read_csv("NYPD_Motor_Vehicle_Collisions.csv")
names(accidents)
## [1] "DATE" "TIME"
## [3] "BOROUGH" "ZIP CODE"
## [5] "LATITUDE" "LONGITUDE"
## [7] "LOCATION" "ON STREET NAME"
## [9] "CROSS STREET NAME" "OFF STREET NAME"
## [11] "NUMBER OF PERSONS INJURED" "NUMBER OF PERSONS KILLED"
## [13] "NUMBER OF PEDESTRIANS INJURED" "NUMBER OF PEDESTRIANS KILLED"
## [15] "NUMBER OF CYCLIST INJURED" "NUMBER OF CYCLIST KILLED"
## [17] "NUMBER OF MOTORIST INJURED" "NUMBER OF MOTORIST KILLED"
## [19] "CONTRIBUTING FACTOR VEHICLE 1" "CONTRIBUTING FACTOR VEHICLE 2"
## [21] "CONTRIBUTING FACTOR VEHICLE 3" "CONTRIBUTING FACTOR VEHICLE 4"
## [23] "CONTRIBUTING FACTOR VEHICLE 5" "UNIQUE KEY"
## [25] "VEHICLE TYPE CODE 1" "VEHICLE TYPE CODE 2"
## [27] "VEHICLE TYPE CODE 3" "VEHICLE TYPE CODE 4"
## [29] "VEHICLE TYPE CODE 5"
head(accidents)
Noteworthy features: It is interesting how much detail NYPD has gone into while recording this information about each accident, so much that upto 5 vehicles contributing to an accident are noted, along with their corresponding vehicle codes, the street intersections are noted along with exact latitude and longitude information. They’ve also recorded extra information such as the number of cyclists and pedestrians killed as a result of the collision/accident.
We started with the analysis of missing values across the variables. Our first step was to use the skim function in the skimr package to get the percentage of missing values across the variables.
skimr::skim(accidents) %>% filter(stat == "missing") %>%
arrange(desc(value)) %>% select(variable, type, value) %>% mutate(percent = value/nrow(accidents))
We observe that variables like CONTRIBUTING FACTOR VEHICLE 5, VEHICLE TYPE CODE 5 etc have the highest missing values - over 99%. These variables correspond to description of accidents involving as high as 5 vehicles, which would happen in very few cases.
It is noteworthy that variables like BOROUGH and ZIP CODE have about 28% missing values.
Since most of the observations do not have variables CONTRIBUTING FACTOR VEHICLE 5, VEHICLE TYPE CODE 5 etc, we decided to drop them. To avoid loss of useful information from this, we created a new binary variable which indicates if 3 or more vehicles were involved in an accident.
accidents <- accidents %>% mutate(VEHICLES_INVOLVED_GTE3 = 1 * !is.na(`CONTRIBUTING FACTOR VEHICLE 3`))
accidents <- accidents %>% select(-c(`VEHICLE TYPE CODE 5`,`CONTRIBUTING FACTOR VEHICLE 5`,
`VEHICLE TYPE CODE 4`,`CONTRIBUTING FACTOR VEHICLE 4`,
`VEHICLE TYPE CODE 3`,`CONTRIBUTING FACTOR VEHICLE 3`,
`OFF STREET NAME`))
Our next step is to identify if the values are missing at random. The visna function in the extracat library is useful for that.
library(extracat)
visna(accidents)
We observe that several variables do not have any missing values. It is more informative to do the visna considering a subset of the variables that have missing values.
accidents_sub <- accidents %>% select(BOROUGH,`ZIP CODE`,LOCATION,
`ON STREET NAME`, `CROSS STREET NAME`,
`CONTRIBUTING FACTOR VEHICLE 1`,`VEHICLE TYPE CODE 1`)
visna(accidents_sub)
We wanted to see how the quality of the data has been changing over time. Our initial hypothesis for the missing values is that the data collection process might not have been great when the NYC Open Data portal started and might have imporved over the years. However, the data suggests otherwise:
accidents <- accidents %>%
mutate(DATE = as.Date(DATE, format='%m/%d/%Y')) %>%
arrange(DATE) %>%
mutate(YEAR = format(DATE, "%Y")) %>%
mutate(MONTH = format(DATE, "%m"))
accidents_missing <- accidents %>%
mutate(MISSING_BOROUGH=1*is.na(BOROUGH),
MISSING_CONTRIBUTION_INFO=1*is.na(`CONTRIBUTING FACTOR VEHICLE 1`)) %>%
select(MISSING_BOROUGH, MISSING_CONTRIBUTION_INFO, YEAR, BOROUGH)
missing_borough_by_year <- accidents_missing %>%
group_by(YEAR) %>% summarise(pc_missing_borough=mean(MISSING_BOROUGH))
missing_contrib_by_year <- accidents_missing %>%
group_by(YEAR) %>% summarise(pc_missing_contrib=mean(MISSING_CONTRIBUTION_INFO))
missing_contrib_by_borough <- accidents_missing %>%
group_by(BOROUGH) %>% summarise(pc_missing_contrib=mean(MISSING_CONTRIBUTION_INFO))
ggplot(data=missing_borough_by_year, aes(x=YEAR, y=pc_missing_borough)) + geom_col(fill="lightblue", color="black") +
xlab("Year") + ylab("% Missing Borough Information") +
ggtitle("% Missing Borough Information vs Year") +
theme(plot.title = element_text(hjust = 0.5))
It appears that the quality of the data has been deteriorating over time. 2012 had only about 23% missing Borough information while 2017 was close to 40%.
Let’s first look art the broad level of the different NYC boroughs and see where most accidents happen:
ggplot(data=subset(accidents, !is.na(BOROUGH)), aes(BOROUGH)) + geom_bar(fill="lightblue", color="black") + xlab("Borough") + ylab('Count of Accidents') + ggtitle('Accidents by Borough (2012 - 2018)') + theme(plot.title = element_text(hjust = 0.5))
It appears that Brooklyn is the borough with the most number of accidents. This is not surprising as it is the most populated one. We can get a better understanding of this by looking at the ratio of annual accidents to the populations of these boroughs.
# Accidents per 100,000 in all Boroughs
borough_per_1000 <- accidents %>%
group_by(`BOROUGH`) %>%
summarise(count_accidents = n()) %>%
filter(!is.na(`BOROUGH`)) %>%
select(`BOROUGH`, count_accidents) %>%
arrange(desc(count_accidents))
pops <- c(1643734, 2629150, 1455720, 2333054, 476015)
boroughs <- c("MANHATTAN","BROOKLYN","BRONX","QUEENS","STATEN ISLAND")
pop_table <- data_frame(POPS = pops, BOROUGH = boroughs)
accidents_summary <- borough_per_1000 %>% left_join(pop_table, by="BOROUGH") %>% mutate(count_accidents = count_accidents/POPS *1000)
ggplot(data=accidents_summary, aes(x=BOROUGH, y=count_accidents)) + geom_col(fill="lightblue", color="black") + xlab("Borough") + ylab('Accidents per 1000 residents') + ggtitle('Accidents per 1000 residents by Borough') +
theme(plot.title = element_text(hjust = 0.5))
This chart reveals a different picture compared to the prior chart. It is apparent that Manhattan has a 25% higher number of accidents per 1000 residents compared to Brooklyn - making it more dangerous.
We looked at a more granular level of aggregation with ZIP codes to identify neighborhoods with the most accidents. These can be plotted on a map using the choroplethrZip library. This library is not available on CRAN and has to be downloaded from GitHub.
library(choroplethrZip)
accidents_zip_other <- accidents %>% filter(!is.na(`ZIP CODE`)) %>%
group_by(`ZIP CODE`,BOROUGH) %>%
summarise(NUMBER_OF_ACCIDENTS=n())
names(accidents_zip_other) <- c("region", "BOROUGH","value")
accidents_zip_other$region <- factor(accidents_zip_other$region)
accidents_zip_other <- accidents_zip_other[!duplicated(accidents_zip_other$region),]
zip_choropleth(accidents_zip_other,
state_zoom = "new york",
county_zoom = c(36005,36047,36061,36081,36085),
title = "Accidents by Zip Code in New York City",
legend = "Number of Accidents") +
theme(plot.title = element_text(hjust = 0.5, size=15)) +
viridis::scale_fill_viridis(discrete = TRUE) + coord_map()
We can make a few quick observations from the chart:
Zooming in on Manhattan:
accidents_zip <- accidents %>% filter(!is.na(`ZIP CODE`)) %>% filter(BOROUGH=="MANHATTAN") %>% group_by(`ZIP CODE`) %>% summarise(NUMBER_OF_ACCIDENTS=n())
names(accidents_zip) <- c("region", "value")
accidents_zip$region <- factor(accidents_zip$region)
zip_choropleth(accidents_zip,
state_zoom = "new york",
county_zoom = 36061,
title = "Accidents by Zip Code in Manhattan",
legend = "Number of Accidents") +theme(plot.title = element_text(hjust = 0.5, size=15))+viridis::scale_fill_viridis(name="# Accidents",discrete = TRUE)+coord_map()
The ZIP codes which have the highest accidents and fatalities are the following:
zip_count <- accidents %>%
group_by(`ZIP CODE`) %>%
summarise(count_accidents = n()) %>%
filter(!is.na(`ZIP CODE`)) %>%
select(`ZIP CODE`, count_accidents) %>%
arrange(desc(count_accidents)) %>%
mutate(`ZIP CODE` = as.character(`ZIP CODE`))
zip_count_top10 <- zip_count[1:10,]
dotchart(zip_count_top10$count_accidents, labels = zip_count_top10$`ZIP CODE`,
pch=19, xlab = "Accident Count", groups = -zip_count_top10$count_accidents,
xlim=c(0,16000), ylab = "ZIP code",
main = "Accidents - Top 10 ZIP codes")
We also looked at the top 10 ZIP codes for fatalities:
zip_count_deaths <- accidents %>%
group_by(`ZIP CODE`) %>%
summarise(count_deaths = sum(`NUMBER OF PERSONS KILLED`)) %>%
filter(!is.na(`ZIP CODE`)) %>%
select(`ZIP CODE`, count_deaths) %>%
arrange(desc(count_deaths)) %>%
mutate(`ZIP CODE` = as.character(`ZIP CODE`))
zip_count_deaths_top10 <- zip_count_deaths[1:10,]
dotchart(zip_count_deaths_top10$count_deaths, labels = zip_count_deaths_top10$`ZIP CODE`,
pch=19, xlab = "Deaths", xlim=c(0,30), main = "Deaths - Top 10 ZIP codes", ylab = "ZIP code")
To get a more granular view of where most accidents happen, we superimposed a heatmap on top of the NYC map to identify specific roads and interesctions with most accidents. The NYC map was obtained using the handy ggmap library.
nycmap <- qmap("new york", zoom = 11, color = "bw")
nycmap + stat_bin2d(aes(x = LONGITUDE, y = LATITUDE), size = .5, bins = 150,
alpha = 0.5, data = accidents) + scale_fill_continuous(low="blue", high="red") +
ggtitle("Accident Density by Location") + theme(plot.title = element_text(hjust = 0.5))
The heatmap is lit up brightly for midtown and lower Manhattan. There are also hotspots in Brooklyn (along Atlantic Avenue) which are brightly lit. This chart illustrates why the accidents/residents is the highest for Manhattan.
We tried to understand the trend in accidents over time. We started by looking at daily accidents over time (over the span of 5 years). This is obviously expected to be noisy, so we added a trend curve using the non-parametric loess smoothing.
accidents_per_day <- accidents %>%
group_by(DATE, BOROUGH) %>%
summarise(NUMBER_OF_ACCIDENTS=n(),
NUMBER_OF_DEATHS = sum(`NUMBER OF PERSONS KILLED`),
NUMBER_OF_INJURIES = sum(`NUMBER OF PERSONS INJURED`)) %>%
filter(NUMBER_OF_ACCIDENTS<400) # Remove one outlier point
ggplot(accidents_per_day, aes(x=DATE, y=NUMBER_OF_ACCIDENTS)) + geom_line(aes(colour="Accidents")) +
geom_smooth(method = "loess", span = .2, se = TRUE, show.legend = TRUE, aes(colour="Trend")) +
scale_colour_manual(name="Legend", values=c("black", "blue")) +
xlab("Time") + ylab("Number of Daily Accidents") +
ggtitle("Daily Accidents") +
theme(plot.title = element_text(hjust = 0.5))
There are a few quick observations we can make from this chart:
To better understand the periodicity aspect of the data, we looked at the number of accidents aggregated over a month level.
accidents_by_month <- accidents %>% filter(YEAR > 2012) %>% filter(YEAR < 2018) %>%
group_by(MONTH) %>% summarise(num_accidents = n())
ggplot(accidents_by_month, aes(x=MONTH, y=num_accidents)) +
geom_bar(stat="identity", fill="lightblue", color="black")+
xlab("Month") + ylab("Total Accidents") +
ggtitle("Accidents by month") +
theme(plot.title = element_text(hjust = 0.5))
The chart shows a dip in the accidents for the first few months of the year. There seem to be fewer accidents in winter. We also tried to plot the same information in a polar coordinate plot to give a better sense of the seasonality.
ggplot(accidents_by_month, aes(x=MONTH, y=num_accidents)) +
geom_bar(stat="identity", fill="lightblue", color="black")+
xlab("Month") + ylab("Total Accidents") +
ggtitle("Accidents by month") +
theme(plot.title = element_text(hjust = 0.5)) +
coord_polar(start=0)
After time of the year, we were interested in looking at trends by time of the day. We aggregated the data by the hour and looked at the accidents. Like before, we made both the cartesian bar charts as well as the polar bar charts.
accidents_by_timeofday <- accidents %>%
mutate(HOUR = format(as.POSIXct(TIME), "%H")) %>%
group_by(HOUR, BOROUGH) %>%
summarise(NUMBER_OF_ACCIDENTS=n(),
NUMBER_OF_DEATHS = sum(`NUMBER OF PERSONS KILLED`),
NUMBER_OF_INJURIES = sum(`NUMBER OF PERSONS INJURED`))# %>%
# filter(NUMBER_OF_ACCIDENTS<400) # Remove one outlier point
ggplot(accidents_by_timeofday, aes(x=HOUR, weights=NUMBER_OF_ACCIDENTS)) +
geom_bar(stat="count", fill="lightblue", color="black") +
xlab("Time") + ylab("Accidents") +
ggtitle("Accidents by Hour") +
theme(plot.title = element_text(hjust = 0.5))
ggplot(accidents_by_timeofday, aes(x=HOUR, weights=NUMBER_OF_ACCIDENTS)) +
geom_bar( stat="count", fill="lightblue", color="black") +
xlab("Time") + ylab("Accidents") +
ggtitle("Accidents by Hour") +
theme(plot.title = element_text(hjust = 0.5)) +
coord_polar()
The above chart is a polar coordinate representation of the bar chart preceding it. The Cartesian bar chart is significantly easier to read and draw inferences from. The differneces in values are linear and intuitive in the Cartesian chart. However it fails to capture the cyclical nature of months (December looping over to January) and time of day (23 to 0) which is better captured by the Polar bar chart.
We analyzed the different contributing factors leading to accidents over the years. This is captured by the CONTRIBUTING FACTOR VEHICLE 1 and CONTRIBUTING FACTOR VEHICLE 2 variables. They correspond to the reason for vehicle 1 and Vehicle 2 respectively. These variables have a large number of missing values:
print("Number of missing values in `CONTRIBUTING FACTOR VEHICLE 1`")
## [1] "Number of missing values in `CONTRIBUTING FACTOR VEHICLE 1`"
print(sum(is.na(accidents$`CONTRIBUTING FACTOR VEHICLE 1`)))
## [1] 6297
print("Number of missing values in `CONTRIBUTING FACTOR VEHICLE 2`")
## [1] "Number of missing values in `CONTRIBUTING FACTOR VEHICLE 2`"
print(sum(is.na(accidents$`CONTRIBUTING FACTOR VEHICLE 2`)))
## [1] 170620
CONTRIBUTING FACTOR VEHICLE 1 has much fewer missing values than CONTRIBUTING FACTOR VEHICLE 2. Analyzing them separately for the purpose of better analyses
checkContrFact1_2 <- accidents %>% filter(!is.na(`CONTRIBUTING FACTOR VEHICLE 1`) & !is.na(`CONTRIBUTING FACTOR VEHICLE 2`)) %>% filter(`CONTRIBUTING FACTOR VEHICLE 1` != "Unspecified" & `CONTRIBUTING FACTOR VEHICLE 2` != "Unspecified") %>% mutate(sameValue = `CONTRIBUTING FACTOR VEHICLE 1`==`CONTRIBUTING FACTOR VEHICLE 2`)
print("Number of records where `CONTRIBUTING FACTOR VEHICLE 1` equals `CONTRIBUTING FACTOR VEHICLE 2`:")
## [1] "Number of records where `CONTRIBUTING FACTOR VEHICLE 1` equals `CONTRIBUTING FACTOR VEHICLE 2`:"
print(sum(checkContrFact1_2$sameValue))
## [1] 98407
Upon analyzing CONTRIBUTING FACTOR VEHICLE 1 and CONTRIBUTING FACTOR VEHICLE 2 columns after omitting records containing NAs and also only considering records where both CONTRIBUTING FACTOR VEHICLE 1 and CONTRIBUTING FACTOR VEHICLE 2 are specified, we observe that 98407 records out of 160161 have the same entry for these two columns.
We plot the top 10 contributing factors for Vehicle 1.
accidents_by_reason_1 <- accidents %>% filter(!is.na(`CONTRIBUTING FACTOR VEHICLE 1`)) %>% filter(`CONTRIBUTING FACTOR VEHICLE 1` != "Unspecified") %>% group_by(`CONTRIBUTING FACTOR VEHICLE 1`) %>%
summarise(NUMBER_OF_ACCIDENTS=n(),
NUMBER_OF_DEATHS = sum(`NUMBER OF PERSONS KILLED`)) %>%
arrange(desc(NUMBER_OF_ACCIDENTS))
accidents_by_reason_1 <- head(accidents_by_reason_1, n = 10)
ggplot(accidents_by_reason_1, aes(x=reorder(`CONTRIBUTING FACTOR VEHICLE 1`, `NUMBER_OF_ACCIDENTS`),
y=`NUMBER_OF_ACCIDENTS`)) +
geom_col(fill="lightblue", color="black") +
xlab("Contributing Factor Vehicle 1") +
ylab("Number of Accidents")+
ggtitle("Top 10 First Contributing Factors by Accident Count")+
theme(plot.title = element_text(hjust = 0.5))+coord_flip()
accidents_by_reason_2 <- accidents %>% filter(!is.na(`CONTRIBUTING FACTOR VEHICLE 2`)) %>% filter(`CONTRIBUTING FACTOR VEHICLE 2` != "Unspecified") %>% group_by(`CONTRIBUTING FACTOR VEHICLE 2`) %>%
summarise(NUMBER_OF_ACCIDENTS=n(),
NUMBER_OF_DEATHS = sum(`NUMBER OF PERSONS KILLED`)) %>%
arrange(desc(NUMBER_OF_ACCIDENTS))
accidents_by_reason_2 <- head(accidents_by_reason_2, n = 10)
ggplot(accidents_by_reason_2, aes(x=reorder(`CONTRIBUTING FACTOR VEHICLE 2`, `NUMBER_OF_ACCIDENTS`),
y=`NUMBER_OF_ACCIDENTS`)) +
geom_col(fill="lightblue", color="black") +
xlab("Contributing Factor Vehicle 2")+ylab("Number of Accidents")+
ggtitle("Top 10 Second Contributing Factors by Accident Count")+
theme(plot.title = element_text(hjust = 0.5))+coord_flip()
We used a heatmap to understand the combination of Contributing Factor for Vehicle 1 and Contributing factor for Vehicle 2 resulting in most accidents.
accidents_by_vehicles <- accidents %>% filter(!is.na(`VEHICLE TYPE CODE 1`) & !is.na(`VEHICLE TYPE CODE 2`)) %>% filter(`VEHICLE TYPE CODE 1` != "UNKNOWN" & `VEHICLE TYPE CODE 2` != "UNKNOWN") %>% group_by(`VEHICLE TYPE CODE 1`,`VEHICLE TYPE CODE 2`) %>%
summarise(NUMBER_OF_ACCIDENTS=n(),
NUMBER_OF_DEATHS = sum(`NUMBER OF PERSONS KILLED`)) %>%
arrange(desc(NUMBER_OF_ACCIDENTS))
accidents_by_vehicles <- head(accidents_by_vehicles, n = 50)
ggplot(accidents_by_vehicles, aes(x = `VEHICLE TYPE CODE 1`, y = `VEHICLE TYPE CODE 2`, fill = NUMBER_OF_ACCIDENTS)) + theme(axis.text.x = element_text(angle = 60, hjust = 1))+geom_tile()+xlab("First Vehicle Type")+ylab("Second Vehicle Type")+ggtitle("Number of Accidents by combinations of Vehicle Types")+theme(plot.title = element_text(hjust = 0.5))+viridis::scale_fill_viridis()
We observe that most accidents occur when one or more drivers of the vehicles involved were distracted/inattentive.
We wanted to observe which locations in NYC were more likely to see the particular contributing factors we noted above. For this we focused only on Manhattan to get a closer view. We also looked only at the top 5 contributing factors.
accidents_by_reasons <- accidents %>% filter(!is.na(`CONTRIBUTING FACTOR VEHICLE 1`) & !is.na(`CONTRIBUTING FACTOR VEHICLE 2`)) %>% filter(`CONTRIBUTING FACTOR VEHICLE 1` != "Unspecified" & `CONTRIBUTING FACTOR VEHICLE 2` != "Unspecified") %>% group_by(`CONTRIBUTING FACTOR VEHICLE 1`,`CONTRIBUTING FACTOR VEHICLE 2`) %>%
summarise(NUMBER_OF_ACCIDENTS=n(),
NUMBER_OF_DEATHS = sum(`NUMBER OF PERSONS KILLED`)) %>%
arrange(desc(NUMBER_OF_ACCIDENTS))
accidents_by_reasons_map <- head(accidents_by_reasons, n=5)
accidents_topCF12 <- accidents %>%
filter(`CONTRIBUTING FACTOR VEHICLE 1` %in% accidents_by_reasons_map$`CONTRIBUTING FACTOR VEHICLE 1` & `CONTRIBUTING FACTOR VEHICLE 2` %in% accidents_by_reasons_map$`CONTRIBUTING FACTOR VEHICLE 2`) %>% filter(!is.na(LATITUDE) & !is.na(LONGITUDE)) %>% filter(BOROUGH=='MANHATTAN')
qmap('manhattan', zoom = 12,color = 'bw', legend = 'topleft') + geom_point(aes(x=LONGITUDE, y=LATITUDE, color=`CONTRIBUTING FACTOR VEHICLE 1`), data=accidents_topCF12,alpha=0.6, size=1.5) + viridis::scale_color_viridis(discrete=TRUE)+ggtitle("Spatial View of Accidents by Top 5 Contributing Factors in Manhattan")+theme(plot.title = element_text(hjust = 0.5, size=15))
We observe an unusual pattern here - the different contributing factors have different “hotspots” across Manhattan. This is better illustrated by using 2D density plots faceted on contributing factor.
qmap('manhattan', zoom = 13,color = 'bw', legend = 'topleft') + geom_point(aes(x=LONGITUDE, y=LATITUDE, color=`NUMBER OF PERSONS KILLED`), data=accidents_topCF12,alpha=0.7, size=1.0) + viridis::scale_color_viridis()+ggtitle("Spatial View of Accidents by Top 5 Contributing Factors in Manhattan")+theme(plot.title = element_text(hjust = 0.5, size=28))+geom_density_2d(aes(x=LONGITUDE, y=LATITUDE, color=`NUMBER OF PERSONS KILLED`), data=accidents_topCF12, color="red")+facet_wrap(~ `CONTRIBUTING FACTOR VEHICLE 1`, nrow=3)+theme(panel.spacing.x=unit(2, "lines"),panel.spacing.y=unit(2, "lines"))+theme(strip.text.x = element_text(size = 25))